Assume I have an data set of $n$ cases of $p$ predictor variables and one response variable. The relationships between the predictors and the response are all approximately linear. In the future, I will get new cases of the predictor variables, and I hope to predict their corresponing response variables as accurately as possible. I’d like to consider three popular types of prediction methods: subset selection, ridge regression, and lasso regression. Which one is likely to give me the best predictions of my future data?
One intuitive approach involves splitting my original data randomly into training and test groups, and seeing how well the methods predict the test data from the training data.
(For information on these and other prediction methods, I recommend Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, which is an outstanding textbook that can be downloaded for free. It also discusses estimation of prediction error.)
compareMSPE
To choose a “best” prediction method, we need to decide what exactly we want to optimize. Let’s assume that we want to minimize mean squared prediction error (MSPE).
I wrote a function compareMSPE to perform a sensible MSPE estimation process for any given set of prediction methods.
By default, compareMSPE compares subset selection, ridge regression, and lasso regression. To use compareMSPE for other methods, you will probably need to write your own helper functions, following the examples of best.subset, best.ridge, and best.lasso.
The default methods (subset, ridge, and lasso) all assume that the relationships between the response and the predictor variables are all approximately linear. The pairs function is a convenient way to judge whether this assumption is reasonable or not. In cases where a nonlinear relationship exists, often you can transform variables to get something close to linearity.
Otherwise, you could code up functions for non-linear prediction methods, such as k-nearest neighbors regression, and pass these functions into compareMSPE.
The function requires a numerical response variable. The requirements on the predictor variables will depend on the mechanics of the prediction method functions that compareMSPE calls. For example, perhaps categorical variables will need to be coded by dummy variables, in some cases.
Below, you will notice several utility functions in the mix as well. The only one that compareMSPE really requires is which.response. However, its default prediction methods require the others. If you supply your own prediction methods, you will not need some of these other utility functions.
Finally, let’s run compareMSPE on a data set to see which method performs best.
Let’s look at boxplots to get a feel for the distributions of the prediction errors for the various methods that were tried.
For this data, ridge and lasso regression appear to be about equally good at predicting assault, and they are both much better than subset selection.
Caution
Use this function in moderation! In general, don’t use this function or other “all powerful” functions to do all the work for you. The most effective data analysis requires extensive thought and visualization. The more you manually play around with your data, the more likely you are to discover irregularities or to come up with insights. Then again, your time is valuable. Use automation with these trade-offs in mind.